Introduction

The goal of this file is to lay out proof-of-concept analyses of the Strongyloides venezuelensis RNAseq dataset originally published by Hunt et al 2018.

Samples included in this database were prepared using different libary construction methods (amplified vs non-amplified), sequencing run batches, and machines (Hunt et al 2018). All samples are filed under the same study accession number, PRJDB3457, but they have different SRA study numbers. Dividing the experiments based on sequencing instrument produces two batches, both of which contain data from Free-living females and thereotically permit batch correction. However, following limma-based batch correction there were still substantial differences between FLF groups from the two batches. We therefore take the conservative approach of treating these two batches separately.
Thus, we define two functional groups for processing and analysis:
1. Group FLF_PF: Free-living females and parasitic females (samples DRR106346 - DRR106357; aka SAMD00096905-SAMD00096910; SRA Study: DRP002629). This set includes 3 biological replicates and two technical replicates per life stage.
2. Group iL3_extended: Egg, L1, iL3s, activated iL3s (1 and 5 day), iL3_lung, Young_FLF, FLF (samples DRR029282, DRR029433 - DRR029445; SRA Study: ). This set includes 1 biological replicated and two technical replicates per life stage, except for activated iL3s, which have a single technical replicate at 1 and 5 days.

Here, only the first group (FLF_PF) is analyzed, as the lack of biological replicates in Group iL3_extended requires a significant adjustment to the analysis pipeline. Specifically, the use of empirical bayes smoothing of gene-wise standard deviations to provide increased power is not possible without biological replicates.

Data Pre-Processing

A full description of Kallisto alignment and data filtering/normalization steps can be found in Sv_RNAseq_Data_Preprocessing.rmd.

Data Analysis

The limma package ( Ritchie et al 2015, Phipson et al 2016) is used to conduct pairwise differential gene expression analyses between life stages. The results of the pairwise comparison is displayed as a volcano plot and interactive DataTable.

Code

All code is echoed under descriptive headers; code chunks are hidden from view by default. Users may show hidden R code by clicking the Show buttons. In addition, all code chunks are collated at the end of the document in an Appendix.

Load and Parse Preprocessed Data

This code loads R data objects that has been preproprocessed by Sv_RNAseq_Data_Preprocessing.rmd.

# Load and Parse Preprocessed Data
load (file = "../Outputs/SvRNAseq_group_FLF_PF_data_preprocessed")
targets <- SvRNAseq.preprocessed.data$targets
annotations <- SvRNAseq.preprocessed.data$annotations
log2.cpm.filtered.norm <- SvRNAseq.preprocessed.data$log2.cpm.filtered.norm
myDGEList.filtered.norm <-SvRNAseq.preprocessed.data$myDGEList.filtered.norm

rm(SvRNAseq.preprocessed.data)

load(file = "../Outputs/Sv_vDGEList")

# Check for presence of output plots folder, generate if it doesn't exist
output.path <- "../Outputs/Plots"
if (!dir.exists(output.path)){
  dir.create(output.path)
}

Hierarchical Clustering and Principle Components Analysis

This code chunk starts with filtered and normalized abundance data in a data frame (not tidy). It will implement hierarchical clustering and PCA analyses on the data. It will plot various graphs, including a dendrogram of the heirachical clustering, and several plots of visualize the PCA. Because the data that are passed into these analyses do not have batch correction applied, the clustering appears dominanted by a batch effect.

# Introduction to this chunk -----------
# This code chunk starts with filtered and normalized abundance data in a data frame (not tidy).
# It will implement hierarchical clustering and PCA analyses on the data.
# It will plot various graphs and can save them in PDF files.
# Load packages ------
suppressPackageStartupMessages({
  library(tidyverse) # you're familiar with this fromt the past two lectures
  library(ggplot2)
  library(RColorBrewer)
  library(ggdendro)
  library(magrittr)
  library(factoextra)
  library(gridExtra)
  library(cowplot)
  library(dendextend)
})

# Identify variables of interest in study design file ----
group <- factor(targets$group)
batch <- factor(targets$batch)
source <- factor(targets$source)

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
colnames(log2.cpm.filtered.norm)<-paste(targets$group,substr(targets$source, 10,12), sep = ".")
distance <- dist(t(log2.cpm.filtered.norm), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p1<-dend %>% 
  dendextend::set("branches_k_color", k = 6) %>% 
  dendextend::set("hang_leaves", c(0.05)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("labels_colors", k = 6) %>% 
  dendextend::set("branches_lwd", c(0.7)) %>% 
  
  as.ggdend %>%
  ggplot (offset_labels = -0.2) +
  theme_dendro() +
  ylim(0, max(get_branches_heights(dend))) +
  labs(title = "S. venezuelensis: Hierarchical Cluster Dendrogram",
       subtitle = "filtered, TMM normalized, group FLF/PF",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))

Dendrogram to visualize heirachical clustering

Clustering performed on filtered and normalized abundance data using the “complete” method.

# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(log2.cpm.filtered.norm), scale.=F, retx=T)
#summary(pca.res) # Prints variance summary for all principal components.

#pca.res$rotation #$rotation shows you how much each gene influenced each PC (called 'scores')
#pca.res$x # 'x' shows you how much each sample influenced each PC (called 'loadings')
#note that these have a magnitude and a direction (this is the basis for making a PCA plot)
## This generates a screeplot: a standard way to view eigenvalues for each PCA. Shows the proportion of variance accounted for by each PC. Plotting only the first 10 dimensions.
p2<-fviz_eig(pca.res,
             barcolor = brewer.pal(8,"Pastel2")[8],
             barfill = brewer.pal(8,"Pastel2")[8],
             linecolor = "black",
             main = "Scree plot: proportion of variance accounted for by each principal component",
             ggtheme = theme_bw()) 

Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)

# Plotting PC1 and PC2
p3<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2, label=targets$group, 
      fill = targets$group,
      color = targets$group
  ) +
  geom_point(size=4, shape= 21, color = "black", alpha = 0.5) +
  #geom_label(color = "black", size = 2) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2", guide = FALSE) +
  #stat_ellipse() +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="S. venezuelensis: Principal Components Analysis of RNAseq Samples",
       caption = "Note: analysis is blind to life stage identity.",
       subtitle ="RNAseq Dataset: Group FLF_PF",
       fill = "Life Stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw()+
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

suppressMessages(ggsave("Sv_Multivariate_Plots_PCA.pdf",
       plot = p3,
       device = "pdf",
       height = 4,
       #width = 7,
       path = output.path))

PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

# Create a PCA 'small multiples' chart ----
pca.res.df <- pca.res$x[,1:3] %>% 
  as_tibble() %>%
  add_column(sample = targets$sample,
             source = source,
             group = group,
             batch = factor(targets$batch))

pca.pivot <- pivot_longer(pca.res.df, # dataframe to be pivoted
                          cols = PC1:PC3, # column names to be stored as a SINGLE variable
                          names_to = "PC", # name of that new variable (column)
                          values_to = "loadings") # name of new variable (column) storing all the values (data)
PC1<-subset(pca.pivot, PC == "PC1")
PC2 <-subset(pca.pivot, PC == "PC2")
#PC3 <- subset(pca.pivot, PC == "PC3")
#PC4 <- subset(pca.pivot, PC == "PC4")

# New facet label names for PCs
PC.labs <- c(paste0("PC1 (",pc.per[1],"%",")"),
             paste0("PC2 (",pc.per[2],"%",")"),
             paste0("PC3 (",pc.per[3],"%",")")
             )
names(PC.labs) <- c("PC1", "PC2", "PC3")

p6<-ggplot(pca.pivot) +
  aes(x=sample, y=loadings) + # you could iteratively 'paint' different covariates onto this plot using the 'fill' aes
  geom_bar(stat="identity", aes(fill = group)) +
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~PC, labeller = labeller(PC = PC.labs)) +
  #geom_bar(data = PC1, stat = "identity", aes(fill = group)) +
  #geom_bar(data = PC2, stat = "identity", aes(fill = source)) +
  labs(title="S. venezuelensis: PCA 'small multiples' plot",
       fill = "Life Stage",
       subtitle ="RNAseq Dataset: Group FLF_PF") +
  scale_x_discrete(limits = targets$sample, labels = targets$source) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10)) +
  coord_flip()

suppressMessages(ggsave("Sv_Multivariate_Plots_Small_Multiples.pdf",
       plot = p6,
       device = "pdf",
       height = 4,
       width = 8,
       path = output.path))

PCA “Small Multiples” Plot

Genes Contributing to PC Identity

This chunk provides additional analysis of the principal components, in order to determine which genes are influencing the identified PCs. It prints an annotated list of genes that are the 10% of contributors (in any direction) to PC1 and PC2.

# Introduction to this chunk ----
# This chunk provides additional analysis of the principal components, in order to determine which genes are influencing the identified PCs.

# Use pca.res$rotation to select genes influencing PC1-6 ----
myscores.df <- pca.res$rotation[,1:6] %>% 
  as_tibble(rownames = "geneID") %>%
  pivot_longer(cols = -geneID, names_to = "PC", values_to = "scores") %>%
  dplyr::mutate(abs_scores = abs(scores)) %>%
  group_by(PC) %>%
  slice_max(abs_scores, prop = .1) # get top 10% of genes in all PCs

# Pull out genes that are the top 10% of contributors (in any direction) to PC1 and PC2. Annotate.
myscores.Top10 <- myscores.df %>%
  dplyr::filter(PC == "PC1" | PC == "PC2") %>%
  dplyr::select(!abs_scores) %>%
  dplyr::arrange(desc(scores), .by_group = T) %>%
  dplyr::left_join(.,(rownames_to_column(annotations, var = "geneID")), by = "geneID") %>%
  dplyr::relocate(UniProtKB, Description, InterPro, GO_term, Ce_geneID, Ce_percent_homology, .after = scores)


# Make Interactive Plot
myscores.Top10.interactive <- myscores.Top10 %>%
  DT::datatable(extensions = c('KeyTable', "FixedHeader", "Buttons", "RowGroup"),
                rownames = FALSE,
                caption = htmltools::tags$caption(
                  style = 'caption-side: top; text-align: left;',
                  htmltools::tags$b('Top 10% of Genes Contributing to S. venezuelensis PC1 and PC2')),
                options = list(keys = TRUE,
                               dom = 'Bfrtip',
                               rowGroup = list(dataSrc = 1),
                               buttons = c('csv', 'excel'),
                               autoWidth = TRUE,
                               scrollX = TRUE,
                               scrollY = '300px',
                               searchHighlight = TRUE, 
                               pageLength = 10, 
                               lengthMenu = c("10", "25", "50", "100"))) %>%
  DT::formatRound(columns=c(3), digits=3)

myscores.Top10.interactive

Heatmap of Gene Expression Across Life Stages

Make a heatmap for all the genes using the Log2CPM values.

suppressPackageStartupMessages({
library(pheatmap)
library(RColorBrewer)
library(heatmaply)
})
# Make a heatmap for all the genes using the Log2CPM values

diffGenes <- v.DEGList.filtered.norm$E %>%
  as_tibble(rownames = "geneID", .name_repair = "unique") %>%
  dplyr::select(!geneID) %>%
  as.matrix()
## Loading required package: limma
rownames(diffGenes) <- rownames(v.DEGList.filtered.norm$E)
colnames(diffGenes) <- as.character(v.DEGList.filtered.norm$targets$source)
clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete")
clustRows <- hclust(as.dist(1-cor(t(diffGenes),
                                  method="pearson")),
                    method="complete")
par(cex.main=1.2)

showticklabels <- c(TRUE,FALSE)
p<-pheatmap(diffGenes,
            color = RdBu(75),
            cluster_rows = clustRows,
            cluster_cols = clustColumns,
            show_rownames = F,
            scale = "row",
            angle_col = 45,
            main = "Sv: Log2 Counts Per Million (CPM) Expression Across Life Stages (Group FLF_PF)"

)

Differentially Expressed Genes

This chunk uses a variance-stabilized DGEList of filtered and normalized abundance data. These data/results are examples, a responsive version of this code is avaliable in a Shiny App.

# Introduction to this chunk ----
# Because we have access to biological and technical replicates, we can use statistical tools for differential expression analysis
# Useful reading on differential expression: https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html

# Load packages ----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma) # differential gene expression using linear modeling
  library(edgeR)
  library(gt) 
  library(DT) 
  library(plotly)
  library(ggthemes)
  library(RColorBrewer)
  source("./theme_Publication.R")
})

diffGenes.df <- v.DEGList.filtered.norm$E %>%
  as_tibble(rownames = "geneID", .name_repair = "unique")

# Set Expression threshold values for plotting and saving DEGs ----
adj.P.thresh <- 0.05
lfc.thresh <- 1 

group <- factor(v.DEGList.filtered.norm$targets$group)
block <- factor (targets$batch)
design <- model.matrix(~0 + group) # no intercept/blocking for matrix, comparisons across group
colnames(design) <- levels(group)


# Fit a linear model to the data ----
fit <- lmFit(v.DEGList.filtered.norm, design = design)

# As an example, generate comparison matrix for a pairwise comparison ----
# iL3s vs FLF
# Note that the target/contrast goups will be divided by the number of life 
# stage groups e.g. PF+FLF/2 - iL3+iL3a+pfL1+ppL1+ppL3/5
comparison <- c('(PF)-(FLF)')

targetStage<- comparison %>%
  str_split(pattern="-", simplify = T) %>%
  .[,1] %>%
  gsub("(", "", ., fixed = TRUE) %>%
  gsub(")", "", ., fixed = TRUE) %>%
  str_split(pattern = "\\+", simplify = T)

contrastStage<-comparison %>%
  str_split(pattern="-", simplify = T) %>%
  .[,2] %>%
  gsub("(", "", ., fixed = TRUE) %>%
  gsub(")", "", ., fixed = TRUE)  %>%
  str_split(pattern = "\\+", simplify = T)

comparison<- sapply(seq_along(comparison),function(x){
  tS <- as.vector(targetStage[x,]) %>%
    .[. != ""] 
  cS <- as.vector(contrastStage[x,]) %>%
    .[. != ""] 
  paste(paste0(tS, 
               collapse = "+") %>%
          paste0("(",.,")/",length(tS)),
        paste0(cS, 
               collapse = "+") %>%
          paste0("(",.,")/",length(cS)),
        sep = "-")
  
})

# Generate contrast matrix ----
contrast.matrix <- makeContrasts(contrasts = comparison,
                                 levels=design)

# extract the linear model fit -----
fits <- contrasts.fit(fit, contrast.matrix)
# empirical bayes smoothing of gene-wise standard deviations provides increased power (see: https://www.degruyter.com/doi/10.2202/1544-6115.1027)
ebFit <- eBayes(fits)

# Pull out the DEGs that pass a specific threshold for all pairwise comparisons ----
# Adjust for multiple comparisons using method = global. 
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value = adj.P.thresh)

recode01<- function(x){
  case_when(x == 1 ~ "Up",
            x == -1 ~ "Down",
            x == 0 ~ "NotSig")
}
diffDesc <- results %>%
  as_tibble(rownames = "geneID") %>%
  dplyr::mutate(across(-geneID, unclass)) %>%
  dplyr::mutate(across(where(is.double), recode01))

# Function that identifies top DEGs between a specific contrast ----
calc_DEG_tbl <- function (ebFit, coef) {
  myTopHits.df <- limma::topTable(ebFit, adjust ="BH", 
                                  coef=coef, number=40000, 
                                  sort.by="logFC") %>%
    as_tibble(rownames = "geneID") %>%
    dplyr::rename(tStatistic = t, LogOdds = B, BH.adj.P.Val = adj.P.Val) %>%
    dplyr::relocate(UniProtKB, Description, InterPro, GO_term, 
                    In.subclade_geneID, In.subclade_percent_homology,
                    Out.subclade_geneID, Out.subclade_percent_homology,
                    Ce_geneID, Ce_percent_homology, .after = LogOdds)
  
  myTopHits.df
}

list.myTopHits.df <- sapply(comparison, function(y){
  calc_DEG_tbl(ebFit, y)}, 
  simplify = FALSE, 
  USE.NAMES = TRUE)

list.myTopHits.df <- sapply(comparison, function(y){
  list.myTopHits.df[[y]] %>%
    dplyr::select(geneID, 
                  logFC, 
                  BH.adj.P.Val:Ce_percent_homology)},
  simplify = FALSE, 
  USE.NAMES = TRUE)

# Get log2CPM values and threshold information for genes of interest
list.myTopHits.df <- sapply(seq_along(comparison), function(y){
  tS<- targetStage[y,][targetStage[y,]!=""]
  cS<- contrastStage[y,][contrastStage[y,]!=""]
  
  concat_name <- function(x) {
    ifelse(x == "target", 
           paste(tS, collapse = "+"), 
           paste(cS, collapse = "+"))
  }
  
  groupAvgs <- diffGenes.df %>%
    dplyr::select(geneID, starts_with(paste0(tS,"-")), 
                  starts_with(paste0(cS,"-"))) %>%
    pivot_longer(cols = -geneID, names_to = c("group","sample"), values_to = "CPM",
                 names_sep = "-") %>%
    dplyr::mutate(contrastID = if_else(group %in% tS,"target", "contrast")) %>%
    group_by(geneID, contrastID) %>%
    dplyr::select(-sample) %>%
    summarize(mean = mean(CPM), .groups = "drop_last") %>%
    pivot_wider(names_from = contrastID, values_from = mean) %>%
    dplyr::relocate(contrast, .after = target) %>%
    dplyr::rename_with(concat_name, -geneID) %>%
    dplyr::rename_with(.cols =-geneID, .fn = ~ paste0("avg_(",.x,")"))
  
  diffGenes.df %>%
    dplyr::select(geneID, starts_with(paste0(tS,"-")), 
                  starts_with(paste0(cS,"-"))) %>%
    left_join(groupAvgs, by = "geneID") %>%
    left_join(list.myTopHits.df[[y]],., by = "geneID") %>%
    left_join(dplyr::select(diffDesc,geneID,comparison[y]), by = "geneID") %>%
    dplyr::rename(DEG_Desc=comparison[y]) %>%
    dplyr::relocate(DEG_Desc) %>%
    dplyr::relocate(logFC:Ce_percent_homology, .after = last_col())
  
},
simplify = FALSE)

comparison <- gsub("/[0-9]*","", comparison)
names(list.myTopHits.df) <- comparison

list.myTopHits.df <- sapply(comparison, function(y){
  list.myTopHits.df[[y]] %>%
    dplyr::mutate(DEG_Desc = case_when(DEG_Desc == "Up" ~ paste0("Up in ", str_split(y,'-',simplify = T)[1,1]),
                                       DEG_Desc == "Down" ~ paste0("Down in ", str_split(y,'-',simplify = T)[1,1]),
                                       DEG_Desc == "NotSig" ~ "NotSig")) 
},
simplify = FALSE, 
USE.NAMES = TRUE)

# PC1 Volcano Plot and Interactive Table ----
vplot1 <- ggplot(list.myTopHits.df[[1]]) +
  aes(y=-log10(BH.adj.P.Val), x=logFC, text = paste(geneID, "<br>",
                                                    "logFC:", round(logFC, digits = 2), "<br>",
                                                    "p-val:", format(BH.adj.P.Val, digits = 3, scientific = TRUE))) +
  geom_point(size=2) +
  geom_hline(yintercept = -log10(adj.P.thresh), 
             linetype="longdash", 
             colour="grey", 
             size=1) + 
  geom_vline(xintercept = lfc.thresh, 
             linetype="longdash", 
             colour="#BE684D", 
             size=1) +
  geom_vline(xintercept = -lfc.thresh, 
             linetype="longdash", 
             colour="#2C467A", 
             size=1) +
  labs(title = paste0('S. venezuelensis: Pairwise Comparison: ',
                      gsub('-',
                           ' vs ',
                           comparison[1])),
       subtitle = paste0("grey line: p = ",
                         adj.P.thresh, "; colored lines: log-fold change = ", lfc.thresh),
       color = "GeneIDs") +
  theme_Publication() 
vplot1

# Interactive Tables
yy<- 1
tS<- targetStage[yy,][targetStage[yy,]!=""]
cS<- contrastStage[yy,][contrastStage[yy,]!=""]
sample.num.tS <- sapply(tS, function(x) {colSums(v.DEGList.filtered.norm$design)[[x]]}) %>% sum()
sample.num.cS <- sapply(cS, function(x) {colSums(v.DEGList.filtered.norm$design)[[x]]}) %>% sum()


n_num_cols <- sample.num.tS + sample.num.cS + 5
index_homologs <- length(colnames(list.myTopHits.df[[yy]])) - 5

LS.datatable <- list.myTopHits.df[[yy]] %>%
  DT::datatable(rownames = FALSE,
                caption = htmltools::tags$caption(
                  style = 'caption-side: top; text-align: left; color: black',
                  htmltools::tags$b('Differentially Expressed Genes in', 
                                    htmltools::tags$em('S. venezuelensis'), 
                                    gsub('-',' vs ',comparison[yy])),
                  htmltools::tags$br(),
                  "Threshold: p < ",
                  adj.P.thresh, "; log-fold change > ",
                  lfc.thresh,
                  htmltools::tags$br(),
                  'Values = log2 counts per million'),
                options = list(autoWidth = TRUE,
                               scrollX = TRUE,
                               scrollY = '300px',
                               scrollCollapse = TRUE,
                               order = list(n_num_cols-1, 
                                            'desc'),
                               searchHighlight = TRUE, 
                               pageLength = 25, 
                               lengthMenu = c("5",
                                              "10",
                                              "25",
                                              "50",
                                              "100"),
                               columnDefs = list(
                                 # list(
                                 #   targets = ((n_num_cols+1)),
                                 #   render = JS(
                                 #     "function(data, row) {",
                                 #     "data.toExponential(1);",
                                 #     "}")
                                 # ),
                                 list(
                                   targets = ((n_num_cols + 
                                                 4):(n_num_cols + 
                                                       5)),
                                   render = JS(
                                     "function(data, type, row, meta) {",
                                     "return type === 'display' && data.length > 20 ?",
                                     "'<span title=\"' + data + '\">' + data.substr(0, 20) + '...</span>' : data;",
                                     "}")
                                 ),
                                 list(targets = "_all",
                                      class="dt-right")
                               ),
                               rowCallback = JS(c(
                                 "function(row, data){",
                                 "  for(var i=0; i<data.length; i++){",
                                 "    if(data[i] === null){",
                                 "      $('td:eq('+i+')', row).html('NA')",
                                 "        .css({'color': 'rgb(151,151,151)', 'font-style': 'italic'});",
                                 "    }",
                                 "  }",
                                 "}"  
                               ))
                               
                )) 
LS.datatable <- LS.datatable %>%
  DT::formatRound(columns=c(3:n_num_cols), 
                  digits=3)

LS.datatable <- LS.datatable %>%
  DT::formatRound(columns=c(n_num_cols+2, 
                            index_homologs+1,
                            index_homologs+3), 
                  digits=2)

LS.datatable <- LS.datatable %>%
  DT::formatSignif(columns=c(n_num_cols+1), 
                   digits=3)

LS.datatable

Benchmarking

This section compares potential results to a published analsysis. Here, we use Supplementary Table 1 from Hunt et al 2018, which includes the results of edgeR differential gene expression analysis between free living females and parasitic females. Note that we manually adjust the geneID names in Table S1 in order to match current Wormbase ParaSite nomenclature, such that ‘SVE_’ is used instead of ‘SVEN_’.

suppressPackageStartupMessages({
  library(openxlsx)
  library(tidyverse)
  library(ggplot2)
})
# Load Hunt Dataset: iL3 vs FLF comparison
temp.dat <-  read.xlsx ("../Data/Benchmarking/41598_2018_23514_MOESM2_ESM.xlsx", 
                        sheet = 1, startRow = 4)

Hunt.dat <- tibble(geneID = temp.dat$QUERY_GENE, logFC = temp.dat$logFC)
Hunt.dat <- Hunt.dat[complete.cases(Hunt.dat),]
rm(temp.dat)

# Rename Results of iL3 vs FLF comparison from Browser
Browser.dat <- list.myTopHits.df$`(PF)-(FLF)` %>%
  dplyr::select(geneID, logFC)

print(paste('Total number of genes in Hunt *et al* 2018 PF vs FLF comparison tab:',nrow(Hunt.dat)))
## [1] "Total number of genes in Hunt *et al* 2018 PF vs FLF comparison tab: 7817"
print(paste('Total number of genes in Str-RNAseq Browser PF vs FLF output file:', nrow(Browser.dat))) 
## [1] "Total number of genes in Str-RNAseq Browser PF vs FLF output file: 12291"
# The plot below takes the genes with LogFC results in both the Browser and Hunt databases, and plots the two sets against each other. 
plotting.all <- inner_join(Browser.dat, Hunt.dat, by = "geneID")

linearMod <- lm(logFC.y ~ logFC.x, data = plotting.all) %>%
  summary()

p.benchmark <- ggplot(plotting.all, aes(x = logFC.x, y = logFC.y)) +
  geom_smooth(method=lm, color = 'red', formula = "y ~ x") +
  geom_point(shape=16, size=3, alpha = 0.8) +
  labs(title = "S. venezuelensis: Str-Browser vs Hunt Data",
       subtitle = "Group: FLF_PF; comparison = PF vs FLF",
       caption = paste("points = genes; red line/shading = linear regression \n",
                       "w/ 95% confidence regions (formula = y ~ x). \n",
                       "Adj R-squared =",
                       round(linearMod$adj.r.squared,3)),
       x = "Str-Browser LogFC",
       y = "Hunt et al 2018 LogFC") +
  coord_equal() +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

print("Linear regression of Browser vs Hunt LogFC results:")
## [1] "Linear regression of Browser vs Hunt LogFC results:"
(linearMod)
## 
## Call:
## lm(formula = logFC.y ~ logFC.x, data = plotting.all)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1338 -0.0553  0.0377  0.0899  5.7993 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.057050   0.003539  -16.12   <2e-16 ***
## logFC.x      1.008550   0.002146  469.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3128 on 7809 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9658 
## F-statistic: 2.209e+05 on 1 and 7809 DF,  p-value: < 2.2e-16
suppressMessages(ggsave("Sv_Benchmarking.pdf",
       plot = p.benchmark,
       device = "pdf",
       height = 4,
       #width = 8,
       path = output.path))

p.benchmark 

# Introduction to this chunk ----
# this chunk creates heatmaps from differentially expressed genes;
# it takes as input a list of genes that are differentially expressed in any life stage
# It selects modules of co-expressed genes based on pearson correlations
# 
# These data/results are examples of possible analyses that can be run on this data.

# Load packages -----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma)
  library(RColorBrewer)
  library(gplots)
  library(heatmaply)
  library(ggplot2)
  library(egg)
  library(dendextend)
  source("./ggheatmap_local.R")
})

# Choose a color pallette ----
#myheatcolors <- rev(brewer.pal(name="RdBu", n=11))
myheatcolors <- RdBu(75)

# Select the comparison
y = 1

# Generate variable containing expression data for the thresholded DEGs 
diffGenes.thresh <- v.DEGList.filtered.norm$E[results[,y] !=0,]


# Cluster DEGs across stages ----
#begin by clustering the genes (rows) for a list of genes that are differentially expressed in at least one life stage
# use the 'cor' function and the pearson method for finding all pairwise correlations of genes
# '1-cor' converts this to a 0-2 scale for each of these correlations, which can then be used to calculate a distance matrix using 'as.dist'
clustRows <- hclust(as.dist(1-cor(t(diffGenes.thresh), method="pearson")), method="complete") 
# hierarchical clustering is a type of unsupervised clustering. 
# NOTE: this cluster may provide different results to one based on log2.cpm.filtered.norm data, likely b/c this version is specifcally focused on genes that are significantly different between conditions.
# Related methods include K-means, SOM, etc 
# unsupervised methods are blind to sample/group identity
# in contrast, supervised methods 'train' on a set of labeled data.  
# supervised clustering methods include random forest, and artificial neural networks

# cluster samples (columns)
clustColumns <- hclust(as.dist(1-cor(diffGenes.thresh, method="spearman")), method="complete") #cluster columns by spearman correlation
#note: use Spearman, instead of Pearson, for clustering samples because it gives equal weight to highly vs lowly expressed transcripts or genes

#Cut the resulting tree and create color vector for clusters.  
module.assign <- stats::cutree(clustRows, k=8) #The diffGenes info is based on a pairwise comparison between all 7 life stages. 

# assign a color to each module (makes it easy to identify and manipulate)
module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9) 
module.color <- module.color[as.vector(module.assign)] 

# # simplfy heatmap by averaging the biological replicates and display only one column per condition
# diffGenes.AVG <- avearrays(diffGenes.thresh)

# plot the hclust results as a heatmap, grouping the life stages together
diffGenes.heatmap <- heatmap.2(diffGenes.thresh,
                               srtCol = 0, adjCol= c(0.5,0.5),
                               Rowv=as.dendrogram(clustRows),
                               Colv=as.dendrogram(clustColumns),
                               key.title = NA,
                               main = paste0("DEG Heatmap (by life stage): "),
                               sub = paste0("Genes pass threshold in >= 1 comparison. Threshold: p < ",
                                            adj.P.thresh, "; log-fold change > ",
                                            lfc.thresh),
                               RowSideColors=module.color,
                               col=rev(myheatcolors), scale='row', labRow=NA,
                               density.info="none", trace="none",
                               cexRow=1, cexCol=1)

## GGPlots version
# gg.diffGenes.heatmap<-ggheatmap_local(diffGenes.thresh,
#                    colors = rev(myheatcolors),
#                    Rowv= ladderize(as.dendrogram(clustRows)),
#                    Colv=ladderize(as.dendrogram(clustColumns)),
#                    key.title = "Log2CPM",
#                    branches_lwd = 0.2,
#                    showticklabels = c(TRUE, FALSE),
#                    scale='row',
#                    cexRow=1, cexCol=1)

# ggsave("./heatmap.pdf", plot = gg.heatmap, width = 11, height = 8, units = "in", device = "pdf")
# Make an interactive version
# interactive.diffGenes.heatmap <- heatmaply(diffGenes.thresh,
#                                  colors = rev(myheatcolors),
#                                  Rowv= ladderize(as.dendrogram(clustRows)),
#                                  Colv=ladderize(as.dendrogram(clustColumns)),
#                                  showticklabels = c(TRUE, FALSE),
#                                  scale='row',
#                                  plot_method = "ggplot",
#                                  branches_lwd = 0.2,
#                                  key.title = "Log2CPM",
#                                  cexRow=1, cexCol=1)

Functional Enrichment Analysis

This code prerform GSEA using the clusterProfiler library. Ability to do this depends on the availability of gene sets. Major databases (e.g. msigdb don’t seem to have Strongyloides information. They do have C. elegans gene sets, but I’m not convinced the homology information is good enough for the comparison to be unbiased/meaningful. In Hunt et al 2016, there is an Ensembl Compara protein family set; we will use this as the basis for our gene set libraries.
Note that this uses specific transcript information, which I throw out (e.g. SSTP_0001137400.2 is recoded as SSTP_0001137400).

Given a priori defined set of gene S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
There are three key elements of the GSEA method:
Calculation of an Enrichment Score.
The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic (Subramanian et al. 2005).
Esimation of Significance Level of ES.
The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
Adjustment for Multiple Hypothesis Testing.
When the entire gene sets were evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.

# Load packages ----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma)
  library(openxlsx)
  library(gplots) #for heatmaps
  library(DT) #interactive and searchable tables of our GSEA results
  library(GSEABase) #functions and methods for Gene Set Enrichment Analysis
  library(Biobase) #base functions for bioconductor; required by GSEABase
  library(GSVA) #Gene Set Variation Analysis, a non-parametric and unsupervised method for estimating variation of gene set enrichment across samples.
  library(gprofiler2) #tools for accessing the GO enrichment results using g:Profiler web resources
  library(clusterProfiler) # provides a suite of tools for functional enrichment analysis
  library(msigdbr) # access to msigdb collections directly within R
  library(enrichplot) # great for making the standard GSEA enrichment plots
})
# Pick a pairwise comparison
yy <- 1

# Carry out GO enrichment using gProfiler2 ----
# GO enrichment requires a pre-selected set of genes. Can use multiple criteria to do that initial selection.
# The GO terms I'm accessing using the gost are from Hunt et al 2016, I believe.

# # PC1 TopTable Results
# enriched.set.pos <-list.myTopHits.df[[yy]] %>% 
#     slice_max(logFC, prop = .1) # get top 10% of genes
# 
# enriched.set.neg <- list.myTopHits.df[[yy]] %>% 
#     slice_min(logFC, prop = .1) # get top 10% of genes
# 
# gost.res.pos <- gost(list(Target_Upregulated = enriched.set.pos$geneID), organism = "ststerprjeb528", correction_method = "fdr")
# gostplot(gost.res.pos, interactive = T, capped = T)
# 
# gost.res.neg <- gost(list(Target_Downregulated_Genes = enriched.set.neg$geneID), organism = "ststerprjeb528", correction_method = "fdr")
# gostplot(gost.res.neg, interactive = T, capped = T)

# Perform GSEA using clusterProfiler ----
# Which library to use for implementation? As per https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz158/5722384: "For expression-based EA on the full expression matrix...When given raw read counts, we recommend to apply a VST such as voom [39] to arrive at library-size normalized logCPMs."
# For testing self-contained null hypothesis (test for association of any gene in the set with the phenotype), use ROAST
# For testing competitive null hypothesis (test for excess of differential expression in a gene set relative to genes outside the set) - **their recommendation**, use PADOG or SAFE?
# 
# Ability to do this depends on the availability of gene sets. Major databases (e.g. msigdb don't seem to have Strongyloides information. They do have C. elegans gene sets, but I'm not convinced the homology information is good enough for the comparison to be unbiased/meaningful. 
# 

# In Hunt et al 2016, there is an Ensembl Compara protein family set
# Note that this uses specific transcript information, which I throw out. 
# (e.g. SSTP_0001137400.2 is recoded as SSTP_0001137400)
ensComp.geneIDs <- read.xlsx ("../Data/Hunt_Parasite_Ensembl_Compara.xlsx", 
                              sheet = 1) %>%
  as_tibble() %>%
  dplyr::select(-Family.members) %>%
  pivot_longer(cols = -Compara.family.id, values_to = "geneID") %>%
  dplyr::select(-name) %>%
  dplyr::filter(grepl("SVE_", geneID))

ensComp.geneIDs$geneID <- str_remove_all(ensComp.geneIDs$geneID, "\\.[0-9]$")
ensComp.geneIDs$geneID <- str_remove_all(ensComp.geneIDs$geneID, "[a-z]$")

# Compare these genes to the list of genes in our filtered, normalized list ----
# 
compara.exclusive <- unique(ensComp.geneIDs$geneID) %>%
  as_tibble_col(column_name = "geneID") %>%
  dplyr::anti_join(diffGenes.df, by = "geneID")
paste('Number of genes exclusive to the Ensembl Compara List: ',nrow(compara.exclusive))
## [1] "Number of genes exclusive to the Ensembl Compara List:  3284"
compara.absent <- unique(ensComp.geneIDs$geneID) %>%
  as_tibble_col(column_name = "geneID") %>%
  dplyr::anti_join(diffGenes.df,., by = "geneID") %>%
  dplyr::select(geneID)
paste('Number of genes exclusive to the RNAseq Gene List: ',nrow(compara.absent))
## [1] "Number of genes exclusive to the RNAseq Gene List:  1391"
# How many genes have associated GO terms? ----
GO.present <- list.myTopHits.df[[yy]]$GO_term %>%
  gsub("NA", NA,.) %>%
  as_tibble_col(column_name = "GO_Term") %>%
  tibble(geneID = list.myTopHits.df[[yy]]$geneID,.) %>%
  dplyr::filter(!is.na(GO_Term))
paste('Number of genes with an associated GO term: ',nrow(GO.present))
## [1] "Number of genes with an associated GO term:  6991"
# Are any of these genes part of those not found in the compara dataset? ---- 
GO.present.Compara.absent <- dplyr::semi_join(GO.present, compara.absent, by = "geneID")
paste('Number of genes with GO terms that are not found in the Ensembl Compara List: ',nrow(GO.present.Compara.absent))
## [1] "Number of genes with GO terms that are not found in the Ensembl Compara List:  535"
# Make a list of genes
ensComp.familyIDs <- read.xlsx ("../Data/Hunt_Parasite_Ensembl_Compara.xlsx", 
                                sheet = 2,
                                cols = c(1,4:6)) %>%
  as_tibble() %>%
  dplyr::mutate(Family_Description = dplyr::coalesce(.$Description, 
                                                     .$`Top.product.(members.with.hit)`, 
                                                     .$`Interpro.top.hit.(members.with.hit)`)
  ) %>%
  dplyr::select(Compara.family.id, Family_Description)

ensComp <- left_join(ensComp.geneIDs, ensComp.familyIDs, by = "Compara.family.id") %>%
  dplyr::select(-Compara.family.id) %>%
  dplyr::rename(gs_name = Family_Description, gene_symbol = geneID) %>%
  dplyr::relocate(gs_name, gene_symbol)

rm(ensComp.geneIDs, ensComp.familyIDs)

# Filter out genes that aren't part of our RNAseq dataset
genelist <- v.DEGList.filtered.norm$genes %>%
  rownames_to_column(var = "geneID") %>%
  dplyr::select(geneID)
ensComp<- ensComp %>%
  dplyr::rename(geneID = gene_symbol) %>%
  left_join(genelist, ., by = "geneID") %>%
  dplyr::relocate(gs_name, geneID)


# Generate rank ordered list of genes ----
mydata.df.sub <- dplyr::select(list.myTopHits.df[[yy]], geneID, logFC)
mydata.gsea <- mydata.df.sub$logFC
names(mydata.gsea) <- as.character(mydata.df.sub$geneID)
mydata.gsea <- sort(mydata.gsea, decreasing = TRUE)

# run GSEA using the 'GSEA' function from clusterProfiler
# Given a priori defined set of gene S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
# There are three key elements of the GSEA method:
# **Calculation of an Enrichment Score.**
# The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic (Subramanian et al. 2005).
# **Esimation of Significance Level of ES.**
# The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
# **Adjustment for Multiple Hypothesis Testing.**
# When the entire gene sets were evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.
myGSEA.res <- GSEA(mydata.gsea, TERM2GENE=ensComp, verbose=FALSE)
## Warning in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, : There are ties in the preranked stats (0.02% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
myGSEA.df <- as_tibble(myGSEA.res@result)

myGSEA.tbl<-as_tibble(myGSEA.res@result) %>%
  dplyr::select(-c(Description, pvalue, enrichmentScore)) %>%
  dplyr::rename(normalized_EnrichmentScore = NES)

# view results as an interactive table
enrichment.DT <- datatable(myGSEA.tbl, 
                           rownames = TRUE,
                           caption =  htmltools::tags$caption(
                             style = 'caption-side: top; text-align: left; color: black',
                             htmltools::tags$b('Gene Families Enriched in ', 
                                               gsub('-',' vs ',
                                                    names(list.myTopHits.df)[[yy]]))
                           ),
                           options = list(
                             autoWidth = TRUE,
                             scrollX = TRUE,
                             #scrollY = '800px',
                             scrollCollapse = TRUE,
                             searchHighlight = TRUE, 
                             order = list(3, 'desc'),
                             pageLength = 25, 
                             lengthMenu = c("5",
                                            "10",
                                            "25",
                                            "50",
                                            "100"),
                             columnDefs = list(
                               list(targets = "_all",
                                    class="dt-right")))) %>%
  formatRound(columns=c(3,5:6), digits=2) %>%
  formatRound(columns=c(4), digits=4)
enrichment.DT
# create enrichment plots using the enrichplot package
# gseaplot2(myGSEA.res, 
#           geneSetID = 3, #can choose multiple signatures to overlay in this plot
#           pvalue_table = FALSE, #can set this to FALSE for a cleaner plot
#           title = "SCP/TAP Gene Set") #can also turn off this title

# add a variable to this result that matches enrichment direction with phenotype
myGSEA.df <- myGSEA.df %>%
  mutate(life_stage = case_when(
    NES > 0 ~ str_split(names(list.myTopHits.df)[[yy]],'-',simplify = T)[1,1],
    NES < 0 ~ str_split(names(list.myTopHits.df)[[yy]],'-',simplify = T)[1,2]))

myGSEA.df$ID <- myGSEA.df$ID %>%
  word(sep = ',') %>%
  #word(sep = '/') %>%
  word(sep = ' and')

# create 'bubble plot' to summarize y signatures across x phenotypes
ggplot(myGSEA.df, aes(x=life_stage, y=ID)) + 
  geom_point(aes(size=setSize, color = NES, alpha=-log10(p.adjust))) +
  scale_color_gradient(low="blue", high="red") +
  labs(title = paste0('S. venezuelensis: Gene Families Enriched in ', 
                      gsub('-',' vs ',
                           names(list.myTopHits.df)[[yy]])),
       subtitle = 'NES = Normalized Enrichment Score; Gene family assignments 
             from Ensembl Compara dataset defined in Hunt et al 2016',
       x = "Life Stage",
       y = "Family ID") +
  #coord_fixed(1/2) +
  theme_bw() +
  theme(plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title = element_text(face = "bold",
                                  size = 13, hjust = 0),
        axis.title = element_text(face = "bold",size = 10.4),
        legend.title = element_text(face="bold",size = 10.4),
        aspect.ratio = 3/1)

Appendix I: Analysis of Full S. venezuelensis Dataset

As stated in the Introduction, samples included in the S. venezuelensis database can be divided into two distinct batches that were prepared using different libary construction methods (amplified vs non-amplified), sequencing run batches, and machines (Hunt et al 2018). The code below runs hierarchical clustering and PCA analyses on all samples, and samples after limma-based batch correction. In both cases, there appear to be substantial differences between FLF groups from the two batches; thus the primary analyses above only analyze a single group and do not attempt to batch correct.

Hierarchical Clustering and Principle Components Analysis on Non-Batch Corrected Data

This code chunk starts with filtered and normalized abundance data in a data frame (not tidy). It will implement hierarchical clustering and PCA analyses on the data. It will plot various graphs, including a dendrogram of the heirachical clustering, and several plots of visualize the PCA. Because the data that are passed into these analyses do not have batch correction applied, the clustering appears dominanted by a batch effect.

Dendrogram to visualize hierarchical clustering

Clustering performed on filtered and normalized abundance data using the “complete” method.

PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

Hierarchical Clustering and PCA on Batch Corrected Data

Clustering performed on batch corrected, voom normalized, filtered and normalized abundance data in a data frame (not tidy). These plots should reveal if the batch correction is effective. Pay close attention to the FLF samples - I’m not convinced their differences have been fully corrected.

Dendrogram to visualize heirachical clustering

Clustering performed on batch-corrected, filtered and normalized abundance data using the “complete” method.

PCA Plot

Appendix II: All code for this report

# Load and Parse Preprocessed Data
load (file = "../Outputs/SvRNAseq_group_FLF_PF_data_preprocessed")
targets <- SvRNAseq.preprocessed.data$targets
annotations <- SvRNAseq.preprocessed.data$annotations
log2.cpm.filtered.norm <- SvRNAseq.preprocessed.data$log2.cpm.filtered.norm
myDGEList.filtered.norm <-SvRNAseq.preprocessed.data$myDGEList.filtered.norm

rm(SvRNAseq.preprocessed.data)

load(file = "../Outputs/Sv_vDGEList")

# Check for presence of output plots folder, generate if it doesn't exist
output.path <- "../Outputs/Plots"
if (!dir.exists(output.path)){
  dir.create(output.path)
}


# Introduction to this chunk -----------
# This code chunk starts with filtered and normalized abundance data in a data frame (not tidy).
# It will implement hierarchical clustering and PCA analyses on the data.
# It will plot various graphs and can save them in PDF files.
# Load packages ------
suppressPackageStartupMessages({
  library(tidyverse) # you're familiar with this fromt the past two lectures
  library(ggplot2)
  library(RColorBrewer)
  library(ggdendro)
  library(magrittr)
  library(factoextra)
  library(gridExtra)
  library(cowplot)
  library(dendextend)
})

# Identify variables of interest in study design file ----
group <- factor(targets$group)
batch <- factor(targets$batch)
source <- factor(targets$source)

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
colnames(log2.cpm.filtered.norm)<-paste(targets$group,substr(targets$source, 10,12), sep = ".")
distance <- dist(t(log2.cpm.filtered.norm), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p1<-dend %>% 
  dendextend::set("branches_k_color", k = 6) %>% 
  dendextend::set("hang_leaves", c(0.05)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("labels_colors", k = 6) %>% 
  dendextend::set("branches_lwd", c(0.7)) %>% 
  
  as.ggdend %>%
  ggplot (offset_labels = -0.2) +
  theme_dendro() +
  ylim(0, max(get_branches_heights(dend))) +
  labs(title = "S. venezuelensis: Hierarchical Cluster Dendrogram",
       subtitle = "filtered, TMM normalized, group FLF/PF",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))
p1
# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(log2.cpm.filtered.norm), scale.=F, retx=T)
#summary(pca.res) # Prints variance summary for all principal components.

#pca.res$rotation #$rotation shows you how much each gene influenced each PC (called 'scores')
#pca.res$x # 'x' shows you how much each sample influenced each PC (called 'loadings')
#note that these have a magnitude and a direction (this is the basis for making a PCA plot)
## This generates a screeplot: a standard way to view eigenvalues for each PCA. Shows the proportion of variance accounted for by each PC. Plotting only the first 10 dimensions.
p2<-fviz_eig(pca.res,
             barcolor = brewer.pal(8,"Pastel2")[8],
             barfill = brewer.pal(8,"Pastel2")[8],
             linecolor = "black",
             main = "Scree plot: proportion of variance accounted for by each principal component",
             ggtheme = theme_bw()) 
p2
pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)

# Plotting PC1 and PC2
p3<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2, label=targets$group, 
      fill = targets$group,
      color = targets$group
  ) +
  geom_point(size=4, shape= 21, color = "black", alpha = 0.5) +
  #geom_label(color = "black", size = 2) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2", guide = FALSE) +
  #stat_ellipse() +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="S. venezuelensis: Principal Components Analysis of RNAseq Samples",
       caption = "Note: analysis is blind to life stage identity.",
       subtitle ="RNAseq Dataset: Group FLF_PF",
       fill = "Life Stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw()+
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

suppressMessages(ggsave("Sv_Multivariate_Plots_PCA.pdf",
       plot = p3,
       device = "pdf",
       height = 4,
       #width = 7,
       path = output.path))
p3
# Create a PCA 'small multiples' chart ----
pca.res.df <- pca.res$x[,1:3] %>% 
  as_tibble() %>%
  add_column(sample = targets$sample,
             source = source,
             group = group,
             batch = factor(targets$batch))

pca.pivot <- pivot_longer(pca.res.df, # dataframe to be pivoted
                          cols = PC1:PC3, # column names to be stored as a SINGLE variable
                          names_to = "PC", # name of that new variable (column)
                          values_to = "loadings") # name of new variable (column) storing all the values (data)
PC1<-subset(pca.pivot, PC == "PC1")
PC2 <-subset(pca.pivot, PC == "PC2")
#PC3 <- subset(pca.pivot, PC == "PC3")
#PC4 <- subset(pca.pivot, PC == "PC4")

# New facet label names for PCs
PC.labs <- c(paste0("PC1 (",pc.per[1],"%",")"),
             paste0("PC2 (",pc.per[2],"%",")"),
             paste0("PC3 (",pc.per[3],"%",")")
             )
names(PC.labs) <- c("PC1", "PC2", "PC3")

p6<-ggplot(pca.pivot) +
  aes(x=sample, y=loadings) + # you could iteratively 'paint' different covariates onto this plot using the 'fill' aes
  geom_bar(stat="identity", aes(fill = group)) +
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~PC, labeller = labeller(PC = PC.labs)) +
  #geom_bar(data = PC1, stat = "identity", aes(fill = group)) +
  #geom_bar(data = PC2, stat = "identity", aes(fill = source)) +
  labs(title="S. venezuelensis: PCA 'small multiples' plot",
       fill = "Life Stage",
       subtitle ="RNAseq Dataset: Group FLF_PF") +
  scale_x_discrete(limits = targets$sample, labels = targets$source) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10)) +
  coord_flip()

suppressMessages(ggsave("Sv_Multivariate_Plots_Small_Multiples.pdf",
       plot = p6,
       device = "pdf",
       height = 4,
       width = 8,
       path = output.path))
p6
# Introduction to this chunk ----
# This chunk provides additional analysis of the principal components, in order to determine which genes are influencing the identified PCs.

# Use pca.res$rotation to select genes influencing PC1-6 ----
myscores.df <- pca.res$rotation[,1:6] %>% 
  as_tibble(rownames = "geneID") %>%
  pivot_longer(cols = -geneID, names_to = "PC", values_to = "scores") %>%
  dplyr::mutate(abs_scores = abs(scores)) %>%
  group_by(PC) %>%
  slice_max(abs_scores, prop = .1) # get top 10% of genes in all PCs

# Pull out genes that are the top 10% of contributors (in any direction) to PC1 and PC2. Annotate.
myscores.Top10 <- myscores.df %>%
  dplyr::filter(PC == "PC1" | PC == "PC2") %>%
  dplyr::select(!abs_scores) %>%
  dplyr::arrange(desc(scores), .by_group = T) %>%
  dplyr::left_join(.,(rownames_to_column(annotations, var = "geneID")), by = "geneID") %>%
  dplyr::relocate(UniProtKB, Description, InterPro, GO_term, Ce_geneID, Ce_percent_homology, .after = scores)


# Make Interactive Plot
myscores.Top10.interactive <- myscores.Top10 %>%
  DT::datatable(extensions = c('KeyTable', "FixedHeader", "Buttons", "RowGroup"),
                rownames = FALSE,
                caption = htmltools::tags$caption(
                  style = 'caption-side: top; text-align: left;',
                  htmltools::tags$b('Top 10% of Genes Contributing to S. venezuelensis PC1 and PC2')),
                options = list(keys = TRUE,
                               dom = 'Bfrtip',
                               rowGroup = list(dataSrc = 1),
                               buttons = c('csv', 'excel'),
                               autoWidth = TRUE,
                               scrollX = TRUE,
                               scrollY = '300px',
                               searchHighlight = TRUE, 
                               pageLength = 10, 
                               lengthMenu = c("10", "25", "50", "100"))) %>%
  DT::formatRound(columns=c(3), digits=3)

myscores.Top10.interactive
suppressPackageStartupMessages({
library(pheatmap)
library(RColorBrewer)
library(heatmaply)
})
# Make a heatmap for all the genes using the Log2CPM values

diffGenes <- v.DEGList.filtered.norm$E %>%
  as_tibble(rownames = "geneID", .name_repair = "unique") %>%
  dplyr::select(!geneID) %>%
  as.matrix()
rownames(diffGenes) <- rownames(v.DEGList.filtered.norm$E)
colnames(diffGenes) <- as.character(v.DEGList.filtered.norm$targets$source)
clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete")
clustRows <- hclust(as.dist(1-cor(t(diffGenes),
                                  method="pearson")),
                    method="complete")
par(cex.main=1.2)

showticklabels <- c(TRUE,FALSE)
p<-pheatmap(diffGenes,
            color = RdBu(75),
            cluster_rows = clustRows,
            cluster_cols = clustColumns,
            show_rownames = F,
            scale = "row",
            angle_col = 45,
            main = "Sv: Log2 Counts Per Million (CPM) Expression Across Life Stages (Group FLF_PF)"

)

# Introduction to this chunk ----
# Because we have access to biological and technical replicates, we can use statistical tools for differential expression analysis
# Useful reading on differential expression: https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html

# Load packages ----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma) # differential gene expression using linear modeling
  library(edgeR)
  library(gt) 
  library(DT) 
  library(plotly)
  library(ggthemes)
  library(RColorBrewer)
  source("./theme_Publication.R")
})

diffGenes.df <- v.DEGList.filtered.norm$E %>%
  as_tibble(rownames = "geneID", .name_repair = "unique")

# Set Expression threshold values for plotting and saving DEGs ----
adj.P.thresh <- 0.05
lfc.thresh <- 1 

group <- factor(v.DEGList.filtered.norm$targets$group)
block <- factor (targets$batch)
design <- model.matrix(~0 + group) # no intercept/blocking for matrix, comparisons across group
colnames(design) <- levels(group)


# Fit a linear model to the data ----
fit <- lmFit(v.DEGList.filtered.norm, design = design)

# As an example, generate comparison matrix for a pairwise comparison ----
# iL3s vs FLF
# Note that the target/contrast goups will be divided by the number of life 
# stage groups e.g. PF+FLF/2 - iL3+iL3a+pfL1+ppL1+ppL3/5
comparison <- c('(PF)-(FLF)')

targetStage<- comparison %>%
  str_split(pattern="-", simplify = T) %>%
  .[,1] %>%
  gsub("(", "", ., fixed = TRUE) %>%
  gsub(")", "", ., fixed = TRUE) %>%
  str_split(pattern = "\\+", simplify = T)

contrastStage<-comparison %>%
  str_split(pattern="-", simplify = T) %>%
  .[,2] %>%
  gsub("(", "", ., fixed = TRUE) %>%
  gsub(")", "", ., fixed = TRUE)  %>%
  str_split(pattern = "\\+", simplify = T)

comparison<- sapply(seq_along(comparison),function(x){
  tS <- as.vector(targetStage[x,]) %>%
    .[. != ""] 
  cS <- as.vector(contrastStage[x,]) %>%
    .[. != ""] 
  paste(paste0(tS, 
               collapse = "+") %>%
          paste0("(",.,")/",length(tS)),
        paste0(cS, 
               collapse = "+") %>%
          paste0("(",.,")/",length(cS)),
        sep = "-")
  
})

# Generate contrast matrix ----
contrast.matrix <- makeContrasts(contrasts = comparison,
                                 levels=design)

# extract the linear model fit -----
fits <- contrasts.fit(fit, contrast.matrix)
# empirical bayes smoothing of gene-wise standard deviations provides increased power (see: https://www.degruyter.com/doi/10.2202/1544-6115.1027)
ebFit <- eBayes(fits)

# Pull out the DEGs that pass a specific threshold for all pairwise comparisons ----
# Adjust for multiple comparisons using method = global. 
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value = adj.P.thresh)

recode01<- function(x){
  case_when(x == 1 ~ "Up",
            x == -1 ~ "Down",
            x == 0 ~ "NotSig")
}
diffDesc <- results %>%
  as_tibble(rownames = "geneID") %>%
  dplyr::mutate(across(-geneID, unclass)) %>%
  dplyr::mutate(across(where(is.double), recode01))

# Function that identifies top DEGs between a specific contrast ----
calc_DEG_tbl <- function (ebFit, coef) {
  myTopHits.df <- limma::topTable(ebFit, adjust ="BH", 
                                  coef=coef, number=40000, 
                                  sort.by="logFC") %>%
    as_tibble(rownames = "geneID") %>%
    dplyr::rename(tStatistic = t, LogOdds = B, BH.adj.P.Val = adj.P.Val) %>%
    dplyr::relocate(UniProtKB, Description, InterPro, GO_term, 
                    In.subclade_geneID, In.subclade_percent_homology,
                    Out.subclade_geneID, Out.subclade_percent_homology,
                    Ce_geneID, Ce_percent_homology, .after = LogOdds)
  
  myTopHits.df
}

list.myTopHits.df <- sapply(comparison, function(y){
  calc_DEG_tbl(ebFit, y)}, 
  simplify = FALSE, 
  USE.NAMES = TRUE)

list.myTopHits.df <- sapply(comparison, function(y){
  list.myTopHits.df[[y]] %>%
    dplyr::select(geneID, 
                  logFC, 
                  BH.adj.P.Val:Ce_percent_homology)},
  simplify = FALSE, 
  USE.NAMES = TRUE)

# Get log2CPM values and threshold information for genes of interest
list.myTopHits.df <- sapply(seq_along(comparison), function(y){
  tS<- targetStage[y,][targetStage[y,]!=""]
  cS<- contrastStage[y,][contrastStage[y,]!=""]
  
  concat_name <- function(x) {
    ifelse(x == "target", 
           paste(tS, collapse = "+"), 
           paste(cS, collapse = "+"))
  }
  
  groupAvgs <- diffGenes.df %>%
    dplyr::select(geneID, starts_with(paste0(tS,"-")), 
                  starts_with(paste0(cS,"-"))) %>%
    pivot_longer(cols = -geneID, names_to = c("group","sample"), values_to = "CPM",
                 names_sep = "-") %>%
    dplyr::mutate(contrastID = if_else(group %in% tS,"target", "contrast")) %>%
    group_by(geneID, contrastID) %>%
    dplyr::select(-sample) %>%
    summarize(mean = mean(CPM), .groups = "drop_last") %>%
    pivot_wider(names_from = contrastID, values_from = mean) %>%
    dplyr::relocate(contrast, .after = target) %>%
    dplyr::rename_with(concat_name, -geneID) %>%
    dplyr::rename_with(.cols =-geneID, .fn = ~ paste0("avg_(",.x,")"))
  
  diffGenes.df %>%
    dplyr::select(geneID, starts_with(paste0(tS,"-")), 
                  starts_with(paste0(cS,"-"))) %>%
    left_join(groupAvgs, by = "geneID") %>%
    left_join(list.myTopHits.df[[y]],., by = "geneID") %>%
    left_join(dplyr::select(diffDesc,geneID,comparison[y]), by = "geneID") %>%
    dplyr::rename(DEG_Desc=comparison[y]) %>%
    dplyr::relocate(DEG_Desc) %>%
    dplyr::relocate(logFC:Ce_percent_homology, .after = last_col())
  
},
simplify = FALSE)

comparison <- gsub("/[0-9]*","", comparison)
names(list.myTopHits.df) <- comparison

list.myTopHits.df <- sapply(comparison, function(y){
  list.myTopHits.df[[y]] %>%
    dplyr::mutate(DEG_Desc = case_when(DEG_Desc == "Up" ~ paste0("Up in ", str_split(y,'-',simplify = T)[1,1]),
                                       DEG_Desc == "Down" ~ paste0("Down in ", str_split(y,'-',simplify = T)[1,1]),
                                       DEG_Desc == "NotSig" ~ "NotSig")) 
},
simplify = FALSE, 
USE.NAMES = TRUE)

# PC1 Volcano Plot and Interactive Table ----
vplot1 <- ggplot(list.myTopHits.df[[1]]) +
  aes(y=-log10(BH.adj.P.Val), x=logFC, text = paste(geneID, "<br>",
                                                    "logFC:", round(logFC, digits = 2), "<br>",
                                                    "p-val:", format(BH.adj.P.Val, digits = 3, scientific = TRUE))) +
  geom_point(size=2) +
  geom_hline(yintercept = -log10(adj.P.thresh), 
             linetype="longdash", 
             colour="grey", 
             size=1) + 
  geom_vline(xintercept = lfc.thresh, 
             linetype="longdash", 
             colour="#BE684D", 
             size=1) +
  geom_vline(xintercept = -lfc.thresh, 
             linetype="longdash", 
             colour="#2C467A", 
             size=1) +
  labs(title = paste0('S. venezuelensis: Pairwise Comparison: ',
                      gsub('-',
                           ' vs ',
                           comparison[1])),
       subtitle = paste0("grey line: p = ",
                         adj.P.thresh, "; colored lines: log-fold change = ", lfc.thresh),
       color = "GeneIDs") +
  theme_Publication() 
vplot1

# Interactive Tables
yy<- 1
tS<- targetStage[yy,][targetStage[yy,]!=""]
cS<- contrastStage[yy,][contrastStage[yy,]!=""]
sample.num.tS <- sapply(tS, function(x) {colSums(v.DEGList.filtered.norm$design)[[x]]}) %>% sum()
sample.num.cS <- sapply(cS, function(x) {colSums(v.DEGList.filtered.norm$design)[[x]]}) %>% sum()


n_num_cols <- sample.num.tS + sample.num.cS + 5
index_homologs <- length(colnames(list.myTopHits.df[[yy]])) - 5

LS.datatable <- list.myTopHits.df[[yy]] %>%
  DT::datatable(rownames = FALSE,
                caption = htmltools::tags$caption(
                  style = 'caption-side: top; text-align: left; color: black',
                  htmltools::tags$b('Differentially Expressed Genes in', 
                                    htmltools::tags$em('S. venezuelensis'), 
                                    gsub('-',' vs ',comparison[yy])),
                  htmltools::tags$br(),
                  "Threshold: p < ",
                  adj.P.thresh, "; log-fold change > ",
                  lfc.thresh,
                  htmltools::tags$br(),
                  'Values = log2 counts per million'),
                options = list(autoWidth = TRUE,
                               scrollX = TRUE,
                               scrollY = '300px',
                               scrollCollapse = TRUE,
                               order = list(n_num_cols-1, 
                                            'desc'),
                               searchHighlight = TRUE, 
                               pageLength = 25, 
                               lengthMenu = c("5",
                                              "10",
                                              "25",
                                              "50",
                                              "100"),
                               columnDefs = list(
                                 # list(
                                 #   targets = ((n_num_cols+1)),
                                 #   render = JS(
                                 #     "function(data, row) {",
                                 #     "data.toExponential(1);",
                                 #     "}")
                                 # ),
                                 list(
                                   targets = ((n_num_cols + 
                                                 4):(n_num_cols + 
                                                       5)),
                                   render = JS(
                                     "function(data, type, row, meta) {",
                                     "return type === 'display' && data.length > 20 ?",
                                     "'<span title=\"' + data + '\">' + data.substr(0, 20) + '...</span>' : data;",
                                     "}")
                                 ),
                                 list(targets = "_all",
                                      class="dt-right")
                               ),
                               rowCallback = JS(c(
                                 "function(row, data){",
                                 "  for(var i=0; i<data.length; i++){",
                                 "    if(data[i] === null){",
                                 "      $('td:eq('+i+')', row).html('NA')",
                                 "        .css({'color': 'rgb(151,151,151)', 'font-style': 'italic'});",
                                 "    }",
                                 "  }",
                                 "}"  
                               ))
                               
                )) 
LS.datatable <- LS.datatable %>%
  DT::formatRound(columns=c(3:n_num_cols), 
                  digits=3)

LS.datatable <- LS.datatable %>%
  DT::formatRound(columns=c(n_num_cols+2, 
                            index_homologs+1,
                            index_homologs+3), 
                  digits=2)

LS.datatable <- LS.datatable %>%
  DT::formatSignif(columns=c(n_num_cols+1), 
                   digits=3)

LS.datatable


suppressPackageStartupMessages({
  library(openxlsx)
  library(tidyverse)
  library(ggplot2)
})
# Load Hunt Dataset: iL3 vs FLF comparison
temp.dat <-  read.xlsx ("../Data/Benchmarking/41598_2018_23514_MOESM2_ESM.xlsx", 
                        sheet = 1, startRow = 4)

Hunt.dat <- tibble(geneID = temp.dat$QUERY_GENE, logFC = temp.dat$logFC)
Hunt.dat <- Hunt.dat[complete.cases(Hunt.dat),]
rm(temp.dat)

# Rename Results of iL3 vs FLF comparison from Browser
Browser.dat <- list.myTopHits.df$`(PF)-(FLF)` %>%
  dplyr::select(geneID, logFC)

print(paste('Total number of genes in Hunt *et al* 2018 PF vs FLF comparison tab:',nrow(Hunt.dat)))
print(paste('Total number of genes in Str-RNAseq Browser PF vs FLF output file:', nrow(Browser.dat))) 


# The plot below takes the genes with LogFC results in both the Browser and Hunt databases, and plots the two sets against each other. 
plotting.all <- inner_join(Browser.dat, Hunt.dat, by = "geneID")

linearMod <- lm(logFC.y ~ logFC.x, data = plotting.all) %>%
  summary()

p.benchmark <- ggplot(plotting.all, aes(x = logFC.x, y = logFC.y)) +
  geom_smooth(method=lm, color = 'red', formula = "y ~ x") +
  geom_point(shape=16, size=3, alpha = 0.8) +
  labs(title = "S. venezuelensis: Str-Browser vs Hunt Data",
       subtitle = "Group: FLF_PF; comparison = PF vs FLF",
       caption = paste("points = genes; red line/shading = linear regression \n",
                       "w/ 95% confidence regions (formula = y ~ x). \n",
                       "Adj R-squared =",
                       round(linearMod$adj.r.squared,3)),
       x = "Str-Browser LogFC",
       y = "Hunt et al 2018 LogFC") +
  coord_equal() +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

print("Linear regression of Browser vs Hunt LogFC results:")
(linearMod)

suppressMessages(ggsave("Sv_Benchmarking.pdf",
       plot = p.benchmark,
       device = "pdf",
       height = 4,
       #width = 8,
       path = output.path))

p.benchmark 

# Introduction to this chunk ----
# this chunk creates heatmaps from differentially expressed genes;
# it takes as input a list of genes that are differentially expressed in any life stage
# It selects modules of co-expressed genes based on pearson correlations
# 
# These data/results are examples of possible analyses that can be run on this data.

# Load packages -----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma)
  library(RColorBrewer)
  library(gplots)
  library(heatmaply)
  library(ggplot2)
  library(egg)
  library(dendextend)
  source("./ggheatmap_local.R")
})

# Choose a color pallette ----
#myheatcolors <- rev(brewer.pal(name="RdBu", n=11))
myheatcolors <- RdBu(75)

# Select the comparison
y = 1

# Generate variable containing expression data for the thresholded DEGs 
diffGenes.thresh <- v.DEGList.filtered.norm$E[results[,y] !=0,]


# Cluster DEGs across stages ----
#begin by clustering the genes (rows) for a list of genes that are differentially expressed in at least one life stage
# use the 'cor' function and the pearson method for finding all pairwise correlations of genes
# '1-cor' converts this to a 0-2 scale for each of these correlations, which can then be used to calculate a distance matrix using 'as.dist'
clustRows <- hclust(as.dist(1-cor(t(diffGenes.thresh), method="pearson")), method="complete") 
# hierarchical clustering is a type of unsupervised clustering. 
# NOTE: this cluster may provide different results to one based on log2.cpm.filtered.norm data, likely b/c this version is specifcally focused on genes that are significantly different between conditions.
# Related methods include K-means, SOM, etc 
# unsupervised methods are blind to sample/group identity
# in contrast, supervised methods 'train' on a set of labeled data.  
# supervised clustering methods include random forest, and artificial neural networks

# cluster samples (columns)
clustColumns <- hclust(as.dist(1-cor(diffGenes.thresh, method="spearman")), method="complete") #cluster columns by spearman correlation
#note: use Spearman, instead of Pearson, for clustering samples because it gives equal weight to highly vs lowly expressed transcripts or genes

#Cut the resulting tree and create color vector for clusters.  
module.assign <- stats::cutree(clustRows, k=8) #The diffGenes info is based on a pairwise comparison between all 7 life stages. 

# assign a color to each module (makes it easy to identify and manipulate)
module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9) 
module.color <- module.color[as.vector(module.assign)] 

# # simplfy heatmap by averaging the biological replicates and display only one column per condition
# diffGenes.AVG <- avearrays(diffGenes.thresh)

# plot the hclust results as a heatmap, grouping the life stages together
diffGenes.heatmap <- heatmap.2(diffGenes.thresh,
                               srtCol = 0, adjCol= c(0.5,0.5),
                               Rowv=as.dendrogram(clustRows),
                               Colv=as.dendrogram(clustColumns),
                               key.title = NA,
                               main = paste0("DEG Heatmap (by life stage): "),
                               sub = paste0("Genes pass threshold in >= 1 comparison. Threshold: p < ",
                                            adj.P.thresh, "; log-fold change > ",
                                            lfc.thresh),
                               RowSideColors=module.color,
                               col=rev(myheatcolors), scale='row', labRow=NA,
                               density.info="none", trace="none",
                               cexRow=1, cexCol=1)

## GGPlots version
# gg.diffGenes.heatmap<-ggheatmap_local(diffGenes.thresh,
#                    colors = rev(myheatcolors),
#                    Rowv= ladderize(as.dendrogram(clustRows)),
#                    Colv=ladderize(as.dendrogram(clustColumns)),
#                    key.title = "Log2CPM",
#                    branches_lwd = 0.2,
#                    showticklabels = c(TRUE, FALSE),
#                    scale='row',
#                    cexRow=1, cexCol=1)

# ggsave("./heatmap.pdf", plot = gg.heatmap, width = 11, height = 8, units = "in", device = "pdf")
# Make an interactive version
# interactive.diffGenes.heatmap <- heatmaply(diffGenes.thresh,
#                                  colors = rev(myheatcolors),
#                                  Rowv= ladderize(as.dendrogram(clustRows)),
#                                  Colv=ladderize(as.dendrogram(clustColumns)),
#                                  showticklabels = c(TRUE, FALSE),
#                                  scale='row',
#                                  plot_method = "ggplot",
#                                  branches_lwd = 0.2,
#                                  key.title = "Log2CPM",
#                                  cexRow=1, cexCol=1)

# Load packages ----
suppressPackageStartupMessages({
  library(tidyverse)
  library(limma)
  library(openxlsx)
  library(gplots) #for heatmaps
  library(DT) #interactive and searchable tables of our GSEA results
  library(GSEABase) #functions and methods for Gene Set Enrichment Analysis
  library(Biobase) #base functions for bioconductor; required by GSEABase
  library(GSVA) #Gene Set Variation Analysis, a non-parametric and unsupervised method for estimating variation of gene set enrichment across samples.
  library(gprofiler2) #tools for accessing the GO enrichment results using g:Profiler web resources
  library(clusterProfiler) # provides a suite of tools for functional enrichment analysis
  library(msigdbr) # access to msigdb collections directly within R
  library(enrichplot) # great for making the standard GSEA enrichment plots
})
# Pick a pairwise comparison
yy <- 1

# Carry out GO enrichment using gProfiler2 ----
# GO enrichment requires a pre-selected set of genes. Can use multiple criteria to do that initial selection.
# The GO terms I'm accessing using the gost are from Hunt et al 2016, I believe.

# # PC1 TopTable Results
# enriched.set.pos <-list.myTopHits.df[[yy]] %>% 
#     slice_max(logFC, prop = .1) # get top 10% of genes
# 
# enriched.set.neg <- list.myTopHits.df[[yy]] %>% 
#     slice_min(logFC, prop = .1) # get top 10% of genes
# 
# gost.res.pos <- gost(list(Target_Upregulated = enriched.set.pos$geneID), organism = "ststerprjeb528", correction_method = "fdr")
# gostplot(gost.res.pos, interactive = T, capped = T)
# 
# gost.res.neg <- gost(list(Target_Downregulated_Genes = enriched.set.neg$geneID), organism = "ststerprjeb528", correction_method = "fdr")
# gostplot(gost.res.neg, interactive = T, capped = T)

# Perform GSEA using clusterProfiler ----
# Which library to use for implementation? As per https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz158/5722384: "For expression-based EA on the full expression matrix...When given raw read counts, we recommend to apply a VST such as voom [39] to arrive at library-size normalized logCPMs."
# For testing self-contained null hypothesis (test for association of any gene in the set with the phenotype), use ROAST
# For testing competitive null hypothesis (test for excess of differential expression in a gene set relative to genes outside the set) - **their recommendation**, use PADOG or SAFE?
# 
# Ability to do this depends on the availability of gene sets. Major databases (e.g. msigdb don't seem to have Strongyloides information. They do have C. elegans gene sets, but I'm not convinced the homology information is good enough for the comparison to be unbiased/meaningful. 
# 

# In Hunt et al 2016, there is an Ensembl Compara protein family set
# Note that this uses specific transcript information, which I throw out. 
# (e.g. SSTP_0001137400.2 is recoded as SSTP_0001137400)
ensComp.geneIDs <- read.xlsx ("../Data/Hunt_Parasite_Ensembl_Compara.xlsx", 
                              sheet = 1) %>%
  as_tibble() %>%
  dplyr::select(-Family.members) %>%
  pivot_longer(cols = -Compara.family.id, values_to = "geneID") %>%
  dplyr::select(-name) %>%
  dplyr::filter(grepl("SVE_", geneID))

ensComp.geneIDs$geneID <- str_remove_all(ensComp.geneIDs$geneID, "\\.[0-9]$")
ensComp.geneIDs$geneID <- str_remove_all(ensComp.geneIDs$geneID, "[a-z]$")

# Compare these genes to the list of genes in our filtered, normalized list ----
# 
compara.exclusive <- unique(ensComp.geneIDs$geneID) %>%
  as_tibble_col(column_name = "geneID") %>%
  dplyr::anti_join(diffGenes.df, by = "geneID")
paste('Number of genes exclusive to the Ensembl Compara List: ',nrow(compara.exclusive))

compara.absent <- unique(ensComp.geneIDs$geneID) %>%
  as_tibble_col(column_name = "geneID") %>%
  dplyr::anti_join(diffGenes.df,., by = "geneID") %>%
  dplyr::select(geneID)
paste('Number of genes exclusive to the RNAseq Gene List: ',nrow(compara.absent))

# How many genes have associated GO terms? ----
GO.present <- list.myTopHits.df[[yy]]$GO_term %>%
  gsub("NA", NA,.) %>%
  as_tibble_col(column_name = "GO_Term") %>%
  tibble(geneID = list.myTopHits.df[[yy]]$geneID,.) %>%
  dplyr::filter(!is.na(GO_Term))
paste('Number of genes with an associated GO term: ',nrow(GO.present))

# Are any of these genes part of those not found in the compara dataset? ---- 
GO.present.Compara.absent <- dplyr::semi_join(GO.present, compara.absent, by = "geneID")
paste('Number of genes with GO terms that are not found in the Ensembl Compara List: ',nrow(GO.present.Compara.absent))

# Make a list of genes
ensComp.familyIDs <- read.xlsx ("../Data/Hunt_Parasite_Ensembl_Compara.xlsx", 
                                sheet = 2,
                                cols = c(1,4:6)) %>%
  as_tibble() %>%
  dplyr::mutate(Family_Description = dplyr::coalesce(.$Description, 
                                                     .$`Top.product.(members.with.hit)`, 
                                                     .$`Interpro.top.hit.(members.with.hit)`)
  ) %>%
  dplyr::select(Compara.family.id, Family_Description)

ensComp <- left_join(ensComp.geneIDs, ensComp.familyIDs, by = "Compara.family.id") %>%
  dplyr::select(-Compara.family.id) %>%
  dplyr::rename(gs_name = Family_Description, gene_symbol = geneID) %>%
  dplyr::relocate(gs_name, gene_symbol)

rm(ensComp.geneIDs, ensComp.familyIDs)

# Filter out genes that aren't part of our RNAseq dataset
genelist <- v.DEGList.filtered.norm$genes %>%
  rownames_to_column(var = "geneID") %>%
  dplyr::select(geneID)
ensComp<- ensComp %>%
  dplyr::rename(geneID = gene_symbol) %>%
  left_join(genelist, ., by = "geneID") %>%
  dplyr::relocate(gs_name, geneID)


# Generate rank ordered list of genes ----
mydata.df.sub <- dplyr::select(list.myTopHits.df[[yy]], geneID, logFC)
mydata.gsea <- mydata.df.sub$logFC
names(mydata.gsea) <- as.character(mydata.df.sub$geneID)
mydata.gsea <- sort(mydata.gsea, decreasing = TRUE)

# run GSEA using the 'GSEA' function from clusterProfiler
# Given a priori defined set of gene S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
# There are three key elements of the GSEA method:
# **Calculation of an Enrichment Score.**
# The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic (Subramanian et al. 2005).
# **Esimation of Significance Level of ES.**
# The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
# **Adjustment for Multiple Hypothesis Testing.**
# When the entire gene sets were evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.
myGSEA.res <- GSEA(mydata.gsea, TERM2GENE=ensComp, verbose=FALSE)
myGSEA.df <- as_tibble(myGSEA.res@result)

myGSEA.tbl<-as_tibble(myGSEA.res@result) %>%
  dplyr::select(-c(Description, pvalue, enrichmentScore)) %>%
  dplyr::rename(normalized_EnrichmentScore = NES)

# view results as an interactive table
enrichment.DT <- datatable(myGSEA.tbl, 
                           rownames = TRUE,
                           caption =  htmltools::tags$caption(
                             style = 'caption-side: top; text-align: left; color: black',
                             htmltools::tags$b('Gene Families Enriched in ', 
                                               gsub('-',' vs ',
                                                    names(list.myTopHits.df)[[yy]]))
                           ),
                           options = list(
                             autoWidth = TRUE,
                             scrollX = TRUE,
                             #scrollY = '800px',
                             scrollCollapse = TRUE,
                             searchHighlight = TRUE, 
                             order = list(3, 'desc'),
                             pageLength = 25, 
                             lengthMenu = c("5",
                                            "10",
                                            "25",
                                            "50",
                                            "100"),
                             columnDefs = list(
                               list(targets = "_all",
                                    class="dt-right")))) %>%
  formatRound(columns=c(3,5:6), digits=2) %>%
  formatRound(columns=c(4), digits=4)
enrichment.DT

# create enrichment plots using the enrichplot package
# gseaplot2(myGSEA.res, 
#           geneSetID = 3, #can choose multiple signatures to overlay in this plot
#           pvalue_table = FALSE, #can set this to FALSE for a cleaner plot
#           title = "SCP/TAP Gene Set") #can also turn off this title

# add a variable to this result that matches enrichment direction with phenotype
myGSEA.df <- myGSEA.df %>%
  mutate(life_stage = case_when(
    NES > 0 ~ str_split(names(list.myTopHits.df)[[yy]],'-',simplify = T)[1,1],
    NES < 0 ~ str_split(names(list.myTopHits.df)[[yy]],'-',simplify = T)[1,2]))

myGSEA.df$ID <- myGSEA.df$ID %>%
  word(sep = ',') %>%
  #word(sep = '/') %>%
  word(sep = ' and')

# create 'bubble plot' to summarize y signatures across x phenotypes
ggplot(myGSEA.df, aes(x=life_stage, y=ID)) + 
  geom_point(aes(size=setSize, color = NES, alpha=-log10(p.adjust))) +
  scale_color_gradient(low="blue", high="red") +
  labs(title = paste0('S. venezuelensis: Gene Families Enriched in ', 
                      gsub('-',' vs ',
                           names(list.myTopHits.df)[[yy]])),
       subtitle = 'NES = Normalized Enrichment Score; Gene family assignments 
             from Ensembl Compara dataset defined in Hunt et al 2016',
       x = "Life Stage",
       y = "Family ID") +
  #coord_fixed(1/2) +
  theme_bw() +
  theme(plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.title = element_text(face = "bold",
                                  size = 13, hjust = 0),
        axis.title = element_text(face = "bold",size = 10.4),
        legend.title = element_text(face="bold",size = 10.4),
        aspect.ratio = 3/1)

# Load and Parse Preprocessed Data
load (file = "../Outputs/SvRNAseq_allSamples_data_preprocessed")
targets <- SvRNAseq.preprocessed.data$targets
annotations <- SvRNAseq.preprocessed.data$annotations
log2.cpm.filtered.norm <- SvRNAseq.preprocessed.data$log2.cpm.filtered.norm
myDGEList.filtered.norm <-SvRNAseq.preprocessed.data$myDGEList.filtered.norm

rm(SvRNAseq.preprocessed.data)

load(file = "../Outputs/Sv_allSamples_vDGEList")

# Introduction to this chunk -----------
# This code chunk starts with filtered and normalized abundance data in a data frame (not tidy).
# It will implement hierarchical clustering and PCA analyses on the data.
# It will plot various graphs and can save them in PDF files.
# Load packages ------
suppressPackageStartupMessages({
  library(tidyverse) # you're familiar with this fromt the past two lectures
  library(ggplot2)
  library(RColorBrewer)
  library(ggdendro)
  library(magrittr)
  library(factoextra)
  library(gridExtra)
  library(cowplot)
  library(dendextend)
})

# Identify variables of interest in study design file ----
group <- factor(targets$group)
batch <- factor(targets$batch)
source <- factor(targets$source)

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
colnames(log2.cpm.filtered.norm)<-targets$group
distance <- dist(t(log2.cpm.filtered.norm), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p1<-dend %>% 
  dendextend::set("branches_k_color", k = 5) %>% 
  dendextend::set("hang_leaves", c(0.1)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("labels_colors", k = 5) %>% 
  dendextend::set("branches_lwd", c(0.7)) %>% 
  
  as.ggdend %>%
  ggplot (offset_labels = -0.5) +
  theme_dendro() +
  ylim(0, max(get_branches_heights(dend))) +
  labs(title = "S. venezuelensis: Hierarchical Cluster Dendrogram ",
       subtitle = "filtered, TMM normalized, not batch corrected",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))
p1
# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(log2.cpm.filtered.norm), scale.=F, retx=T)

pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)

# Plotting PC1 and PC2
p3<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2, label=targets$group, 
      fill = targets$group,
      color = targets$group
  ) +
  geom_point(size=4, shape= 21, color = "black", alpha = 0.5) +
  #geom_label(color = "black", size = 2) +
  #scale_fill_brewer(palette = "Set2") +
  #scale_color_brewer(palette = "Set2", guide = FALSE) +
  #stat_ellipse() +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="S. venezuelensis: Principal Components Analysis of RNAseq Samples",
       subtitle = "Not batch corrected, full dataset",
       caption = "Note: analysis is blind to life stage identity.",
       fill = "Life Stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw()
p3
# Introduction to this chunk -----------
# This code chunk starts with batch corrected, voom normalized, filtered and normalized abundance data in a data frame (not tidy).
# It will implement hierarchical clustering and PCA analyses on the data.
# It will plot various graphs and can save them in PDF files.
# Load packages ------
suppressPackageStartupMessages({
  library(tidyverse) # you're familiar with this fromt the past two lectures
  library(ggplot2)
  library(RColorBrewer)
  library(ggdendro)
  library(magrittr)
  library(factoextra)
  library(gridExtra)
  library(cowplot)
  library(dendextend)
})

# Identify variables of interest in study design file ----
group <- factor(v.DEGList.filtered.norm$targets$group)
source <- factor(v.DEGList.filtered.norm$targets$samples)
batch <- factor(v.DEGList.filtered.norm$design[,2])

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
distance <- dist(t(v.DEGList.filtered.norm$E), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p1<-dend %>% 
  dendextend::set("branches_k_color", k = 5) %>% 
  dendextend::set("hang_leaves", c(0.1)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("labels_colors", k = 5) %>% 
  dendextend::set("branches_lwd", c(0.7)) %>% 
  
  as.ggdend %>%
  ggplot (offset_labels = -0.5) +
  theme_dendro() +
  ylim(0, max(get_branches_heights(dend))) +
  labs(title = "S. venezuelensis: Hierarchical Cluster Dendrogram",
       subtitle = "filtered, TMM normalized, batch-corrected",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))
p1
# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(v.DEGList.filtered.norm$E), scale.=F, retx=T)

pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)

# Plotting PC1 and PC2
p3<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2, label=colnames(v.DEGList.filtered.norm$E), 
      fill = colnames(v.DEGList.filtered.norm$E),
      color = colnames(v.DEGList.filtered.norm$E)
  ) +
  geom_point(size=4, shape= 21, color = "black", alpha = 0.5) +
  #geom_label(color = "black", size = 2) +
  #scale_fill_brewer(palette = "Set2") +
  #scale_color_brewer(palette = "Set2", guide = FALSE) +
  #stat_ellipse() +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="S. venezuelensis: Principal Components Analysis of RNAseq Samples",
       subtitle = "Batch-corrected, full dataset",
       caption = "Note: analysis is blind to life stage identity.",
       fill = "Life Stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw()
p3
sessionInfo()

Appendix III: Session Info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] enrichplot_1.6.1       msigdbr_7.1.1          clusterProfiler_3.14.3
##  [4] gprofiler2_0.1.9       GSVA_1.34.0            GSEABase_1.48.0       
##  [7] graph_1.64.0           annotate_1.64.0        XML_3.99-0.3          
## [10] AnnotationDbi_1.48.0   IRanges_2.20.2         S4Vectors_0.24.4      
## [13] Biobase_2.46.0         BiocGenerics_0.32.0    gplots_3.0.4          
## [16] openxlsx_4.1.5         ggthemes_4.2.0         DT_0.14               
## [19] gt_0.2.1               edgeR_3.28.1           limma_3.42.2          
## [22] heatmaply_1.1.1        viridis_0.5.1          viridisLite_0.3.0     
## [25] plotly_4.9.2.9000      pheatmap_1.0.12        dendextend_1.13.4     
## [28] cowplot_1.0.0          gridExtra_2.3          factoextra_1.0.7      
## [31] magrittr_1.5           ggdendro_0.1-20        RColorBrewer_1.1-2    
## [34] forcats_0.5.0          stringr_1.4.0          dplyr_1.0.1           
## [37] purrr_0.3.4            readr_1.3.1            tidyr_1.1.1           
## [40] tibble_3.0.3           ggplot2_3.3.2          tidyverse_1.3.0       
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.1.0       RSQLite_2.2.0          htmlwidgets_1.5.1.9001
##   [4] grid_3.6.3             TSP_1.1-10             BiocParallel_1.20.1   
##   [7] munsell_0.5.0          codetools_0.2-16       withr_2.2.0           
##  [10] colorspace_1.4-1       GOSemSim_2.12.1        knitr_1.29            
##  [13] rstudioapi_0.11        ggsignif_0.6.0         DOSE_3.12.0           
##  [16] labeling_0.3           urltools_1.7.3         polyclip_1.10-0       
##  [19] bit64_0.9-7            farver_2.0.3           vctrs_0.3.2           
##  [22] generics_0.0.2         xfun_0.15              gclus_1.3.2           
##  [25] R6_2.4.1               graphlayouts_0.7.0     seriation_1.2-8       
##  [28] locfit_1.5-9.4         bitops_1.0-6           fgsea_1.12.0          
##  [31] gridGraphics_0.5-0     assertthat_0.2.1       promises_1.1.1        
##  [34] scales_1.1.1           ggraph_2.0.3           gtable_0.3.0          
##  [37] tidygraph_1.2.0        rlang_0.4.7            splines_3.6.3         
##  [40] rstatix_0.6.0          lazyeval_0.2.2         broom_0.5.6           
##  [43] europepmc_0.4          BiocManager_1.30.10    yaml_2.2.1            
##  [46] reshape2_1.4.4         abind_1.4-5            modelr_0.1.8          
##  [49] crosstalk_1.1.0.1      backports_1.1.8        httpuv_1.5.4          
##  [52] qvalue_2.18.0          tools_3.6.3            ggplotify_0.0.5       
##  [55] ellipsis_0.3.1         ggridges_0.5.2         Rcpp_1.0.5            
##  [58] plyr_1.8.6             progress_1.2.2         RCurl_1.98-1.2        
##  [61] prettyunits_1.1.1      ggpubr_0.4.0           haven_2.3.1           
##  [64] ggrepel_0.8.2          cluster_2.1.0          fs_1.4.2              
##  [67] data.table_1.12.8      DO.db_2.9              triebeard_0.3.0       
##  [70] reprex_0.3.0           hms_0.5.3              mime_0.9              
##  [73] evaluate_0.14          xtable_1.8-4           rio_0.5.16            
##  [76] readxl_1.3.1           compiler_3.6.3         KernSmooth_2.23-17    
##  [79] crayon_1.3.4           htmltools_0.5.0        mgcv_1.8-31           
##  [82] later_1.1.0.1          geneplotter_1.64.0     lubridate_1.7.9       
##  [85] DBI_1.1.0              tweenr_1.0.1           dbplyr_1.4.4          
##  [88] MASS_7.3-51.6          Matrix_1.2-18          car_3.0-8             
##  [91] cli_2.0.2              gdata_2.18.0           igraph_1.2.5          
##  [94] pkgconfig_2.0.3        rvcheck_0.1.8          registry_0.5-1        
##  [97] foreign_0.8-76         xml2_1.3.2             foreach_1.5.0         
## [100] webshot_0.5.2          rvest_0.3.5            digest_0.6.25         
## [103] rmarkdown_2.3          cellranger_1.1.0       fastmatch_1.1-0       
## [106] curl_4.3               shiny_1.5.0            gtools_3.8.2          
## [109] lifecycle_0.2.0        nlme_3.1-148           jsonlite_1.7.0        
## [112] carData_3.0-4          fansi_0.4.1            pillar_1.4.6          
## [115] lattice_0.20-41        fastmap_1.0.1          httr_1.4.2            
## [118] GO.db_3.10.0           glue_1.4.1             zip_2.0.4             
## [121] shinythemes_1.1.2      iterators_1.0.12       bit_1.1-15.2          
## [124] ggforce_0.3.2          stringi_1.4.6          blob_1.2.1            
## [127] caTools_1.18.0         memoise_1.1.0